爬虫利器BeautifulSoup之CSS选择器的基本使用

您所在的位置：网站首页 › css class选择器采集 › 爬虫利器BeautifulSoup之CSS选择器的基本使用

爬虫利器BeautifulSoup之CSS选择器的基本使用

2024-01-06 08:38| 来源: 网络整理| 查看: 265

1.Beautiful Soup简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

2.BeautifulSoup中CSS选择器的基本使用 2.1 选取一段html代码

我这里从百度首页复制了一些html代码作为例子使用，请将以下代码保存到同级目录下，文件命名为test.html：

practice BeautifulSoup 新闻 hao123 地图直播视频贴吧学术 #苏炳添有望圆梦奥运奖牌# 小学生为要偶像签名被骗19100元 40秒回顾英仙座流星雨划过天际奥运接力银牌得主被停赛

在这里插入图片描述

2.2 导入html文本，实例化对象 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser')

读取test.html文件内容，指定解析器为html.parser，使用BeautifulSoup把html文本实例化为一个bs4.BeautifulSoup对象，接下来的一系列操作皆使用该对象的select方法提取信息。

3.基本使用 3.1直接选择标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('title') for item in items: print(item.name) print(item.string) # 结果： # title # practice BeautifulSoup

以提取title标签为例，直接把标签名称作为参数，可以直接从文本中提取出title标签，select方法返回对象是一个bs4.element.ResultSet数组，遍历数组元素，每个元素是一个bs4.element.Tag对象，使用该对象的name属性可以得到标签名称，使用string方法可以得到标签文本信息。

3.2根据id选择标签

CSS以id选择标签，直接在id前面加一个#号，即可选择该标签，以选取id等于s-top-left的标签为例：

from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('#s-top-left') print(items) # 结果 # [ # 新闻 # hao123 # 地图 # 直播 # 视频 # 贴吧 # 学术 # ]

如果要选择id为s-top-left的div标签，可把div加在#前面，代码如下，结果与上述结果相同

from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('div#s-top-left') print(items) 3.3 根据属性选择标签以及获取标签文本值和属性值

以属性值选择标签，直接在属性值前面加个.作为select的参数即可选中所有符合条件的标签，这里以选择属性值为mnav1的a标签为例:

from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('a.mnav1') for item in items: print(item) # 每一个a标签 print(item.string) # 标签文本信息 print(item.attrs) # 标签所有的属性 print(item.get('class')) # 获取属性值 print() # 结果： # 新闻 # 新闻 # {'href': 'http://news.baidu.com', 'class': ['mnav1']} # ['mnav1'] # # 视频 # 视频 # {'href': 'https://haokan.baidu.com/?sfrom=baidu-top', 'class': ['mnav1']} # ['mnav1'] 3.4 递进式选择标签 3.4.1 具有直接父子关系的标签使用 ‘>’

例如：选择id为wrapper下的子一代为div子二代为a的标签，注意表达式中相邻标签必须为父子关系，即id为wrapper的标签的儿子节点为div，孙子节点为a标签

from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('#wrapper > div > a') for item in items: print(item) # 结果： # 新闻 # hao123 # 地图 # 直播 # 视频 # 贴吧 # 学术 3.4.2 不具有直接父子关系的标签使用空格表示

例如：选择body标签下的li标签的span标签，其中body和li并不是直接父子关系，但是li是body的子孙节点，所以用空格表示即可：

from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('body li span') for item in items: print(item) # 结果 # #苏炳添有望圆梦奥运奖牌# # 小学生为要偶像签名被骗19100元 # 40秒回顾英仙座流星雨划过天际 # 奥运接力银牌得主被停赛 3.5选择具有href属性的标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('[href]') for item in items: print(item) # 结果: # 新闻 # hao123 # 地图 # 直播 # 视频 # 贴吧 # 学术 3.6同时选取多个标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('div#s-top-left, ul#hotsearch-content-wrapper') for item in items: print(item) # 结果: # # 新闻 # hao123 # 地图 # 直播 # 视频 # 贴吧 # 学术 # # # # #苏炳添有望圆梦奥运奖牌# # # # 小学生为要偶像签名被骗19100元 # # # 40秒回顾英仙座流星雨划过天际 # # # 奥运接力银牌得主被停赛 # # 3.7 选择具有href属性的a标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('a[href]') for item in items: print(item) # 结果: # 新闻 # hao123 # 地图 # 直播 # 视频 # 贴吧 # 学术 3.8根据具体的属性值选择标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('[href="https://haokan.baidu.com/?sfrom=baidu-top"]') for item in items: print(item) # 结果: # 视频 3.9选择href属性值以https开头的a标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('a[href^="https"]') for item in items: print(item) # 结果: # hao123 # 直播 # 视频 3.10选择以hao123.com结尾的a标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('a[href$="hao123.com"]') for item in items: print(item) # 结果: # hao123 3.11选择href属性包含‘www’的a标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('a[href*="www"]') for item in items: print(item) # 结果: # hao123 3.12 选择具有class属性的a标签 from bs4 import BeautifulSoup soup = BeautifulSoup(open('test.html'), 'html.parser') items = soup.select('a[class]') for item in items: print(item) # 结果: # 新闻 # hao123 # 地图 # 直播 # 视频 # 贴吧 # 学术 4.最后

如有错误，敬请指正！

【本文地址】

爬虫利器BeautifulSoup之CSS选择器的基本使用

爬虫利器BeautifulSoup之CSS选择器的基本使用

今日新闻

推荐新闻